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ABSTRACT 

This paper proposes a deep denoising auto-eneoder technique to ex¬ 
tract better acoustic features for speech synthesis. The technique 
allows us to automatically extract low-dimensional features from 
high dimensional spectral features in a non-linear, data-driven, un¬ 
supervised way. We compared the new stochastic feature extrac¬ 
tor with conventional mel-cepstral analysis in analysis-by-synthesis 
and text-to-speech experiments. Our results confirm that the pro¬ 
posed method increases the quality of synthetic speech in both ex¬ 
periments. 

Index Terms — Speech synthesis, HMM, DNN, Auto-encoder 

1. INTRODUCTION 

Current statistical parametric speech synthesis typically uses hidden 
Markov models (HMMs) to represent probability densities of speech 
trajectories given text m This is a well-established method and it 
is straightforward to apply this framework for new languages. It 
also offers interesting advantages in terms of flexibility and compact 
footprint EEnmu. It is known, however, that speech synthesized 
from statistical models still sounds somehow artificial and less nat¬ 
ural compared to speech synthesized by the best unit selection sys¬ 
tems. 

It is often said that averaging in statistical synthesis systems 
partly removes spectral fine structure of natural speech, and thus 
there is room for the improving the segmental quality. A stochas¬ 
tic postfilter approach lb) proposes to use a deep neural network 
(DNN) to model the conditional probability of the spectral differ¬ 
ences between natural and synthetic speech. The approach is able 
to reconstruct the spectral fine structure lost during modeling and 
has achieved significantly quality improvement for synthetic speech 
Ibl . In this experiment, the HMM-based speech synthesiser was 
trained in the mel-cepstral domain, while the DNN-based postfiler 
was trained in the spectral domain. 

This indicates that the current statistical parametric speech syn¬ 
thesis suffers from quality loss due to statistical averaging in the mel- 
cepstral domain, but also due to conversion from high-dimensional 
spectral features to lower dimensional mel-cepstral parameters and 
hence this brings us a new question: are current intermediate repre¬ 
sentations such as mel-cepstral coefficients appropriate for statistical 
training of acoustic models? Can we automatically find a more ap¬ 
propriate intermediate representation that suits acoustic modelling 
and results in better quality of synthetic speech? 
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To answer this question, this paper proposes a DNN-based fea¬ 
ture extraction method. More specifically we propose to use a deep 
denoising auto-encoder technique as a non-linear robust feature ex¬ 
tractor for speech synthesis and apply it to high-dimensional spectral 
features obtained from STRAIGHT vocoder Q. We compare this 
data-driven, unsupervised feature extraction approach with the con¬ 
ventional mel-cepstral analysis, which is based on a linear discrete 
cosine transform of the log spectrum. 

This paper is organised as follows: in Section 2, we outline re¬ 
lated DNN-based approaches and in Section 3 we describe the pro¬ 
posed deep denoising auto-encoder technique. In Section 4, we men¬ 
tion how we train the model and the experimental conditions and 
evaluation results are shown in Section 5. Discussions and the sum¬ 
mary of our findings are given in Section 6. 

2. RELATED WORK USING DNN AND AUTO-ENCODER 

This section overviews related work using DNN and/or auto-encoder 
in the speech information processing field. DNN has been applied 
for acoustic modelling of speech synthesis. For instance |[8J uses 
DNN to learn the relationship between input texts and the extract 
features instead of decision tree-based state tying. Restricted Boltz¬ 
mann machines or deep belief networks have been used for mod¬ 
elling output probabilities of HMM states instead of GMMs 191 . 
Recurrent neural network or long-short term memory was used for 
prosody modelling ITol or acoustic trajectory modelling (m. 

To the best of our knowledge, this is the first work to use deep 
denoising auto-encoder for speech synthesis, but, deep auto-encoder 
based bottleneck features are used by several groups for ASR m 
ESI and deep denoising auto-encoder is also verified for noise-robust 
ASR ITU or reverberant ASR tasks ITSlIT^ . 

Techniques that are closely related to this paper are a spectral 
binary coding approach using deep auto-encoder proposed by Deng 
et al Cvl and a speech enhancement approach using deep denois¬ 
ing auto-encoder where they try to reconstruct clean spectrum from 
noisy spectrum cn. The approach proposed here is also related 
to heteroscedastic linear discriminant analysis (HLDA) and 

probabilistic linear discriminant analysis (PLDA) |2T]|22l|23l. Our 
key idea is however different from these as we use deep auto-encoder 
based continuous bottleneck features calculated from spectrum to re¬ 
construct high-quality synthetic speech. 

3. AUTO-ENCODER 
3.1. Basic Auto-encoder 

Auto-encoder is an artificial neural network that is used generally for 
learning a compressed and distributed representation of a dataset. It 
consists of the encoder and the decoder. The encoder maps a input 



( 1 ) 


vector X to a hidden representation y as follows: 

y = /s(x) = s(Wx + b), 

where 6 = {W, b}. W and b represent a m x n weight matrix 
and a bias vector of dimensionality m respectively, where n is the 
dimension of x. The function s is a non-linear transformation on the 
linear mapping Wx + b. Frequently s is a sigmoid, a tanh, and a 
relu function, y, the output of the encoder, is then mapped to z, the 
output of the decoder. The mapping is performed by a linear function 
alone that employs a n x m weight matrix W' and a bias vector of 
dimensionality n as follows: 

Z = 9e'{y) = w'y + b', 

or a linear mapping followed by a non-linear transformation t 
Z = 9e'iy) = f(w'y+ b'), 

where 0' = {W', b'}. The weight for the decoding is set as the 
transpose of the encoding weight in order to allow more lay¬ 
ers to be stacked together and be fine-tuned with stochastic gradient 
descend (SGD). 

In general, the output z should be interpreted as a function of 
parameters {0,0'} as z = Qo'ifeix)). The parameters {0,0'} are 
optimized such that the reconstructed z is as close as possible to the 
original x and maximizes P(x|z). A typical loss function used is 
the mean square error (MSB), i.e. L(x, z) = ^ |x — z|^. 
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Fig. 1. Greedy layer-wise pre-training for constructing deep auto- 
dencoder. 

pre-training. In pre-training a 1-hidden-layer auto-encoder is trained 
and the encoding output of the locally trained layer is used as the 
input for the next layer. This layer-wise training is repeated until 
the desired layer size is obtained. The encoding, decoding and loss 
functions of each layer are represented as follows: 


3.2. Denoising Auto-encoder 

The denoising auto-encoder is a variant of the basic auto-encoder. 
It is reported that the denoising auto-encoder can extract features 
more robustly than the basic auto-encoder 1^ . In the denoising 
auto-encoder, the original data x is first corrupted to x before it is 
mapped to a higher representation fe (x) by an encoder. The decoder 
then maps the higher representation to the output z for reconstructing 
the original x. The denoising auto-encoder is trained such that the 
reconstructed z is as close as possible to the original data x. Note 
that it is only during training that the denoising auto-encoder is used 
to reconstruct the original x from the corrupted x. 

3.3. Deep Auto-encoder 

Auto-encoder or denoisng auto-encoder can be made deeper by 
stacking multiple layers of encoders and decoders to form a deep 
architecture. ED shows that deeper architecture produces better 
high-level features compared to the shallow architecture up to 4 en¬ 
coding and 4 decoding layers. For constructing a deep auto-encoder 
pre-training is widely used. In pre-training, the number of layers 
in a deep auto-encoder increases twice as compare to a deep neural 
network (DNN) when stacking each pre-trained unit. It is reported 
that fine-tuning with back-propagation through a deep auto-encoder 
is ineffective due to vanishing gradients at the lower layers mi. To 
over come this issue we restrict the decoding weight as the transpose 
of the encoding weight following 1^ . that is, W' = where 
denotes transpose of W. We describe the detail of training a 
deep auto-encoder in the next session. 


Layer 1: 

yi = /wi,bi(x), 

= 9w[,h[ (yi)? 

Layer k (k>l): 

Yk = /Wfe,bfc(yfc-i), 

Zfc = (yfc), 

L(y/e_i,Z/e) = \yk-l - Z/e|^. 

Note that during the pre-training of the deep denoising auto-encoder, 
the input x, yk for each layer are corrupted to x and yk respectively. 
After all layers are pre-trained, all the pre-trained layers are stacked 
for constructing a deep denoising auto-encoder in the same way as 
the deep auto-encoder. 

4.2. Fine-tuning 

The purpose of fine-tuning is to minimize the reconstruction error 
L(x, z) over the entire dataset and a model architecture using error 
back-propagation 1^ . We use the mean square error (MSB) for the 
loss function of a deep auto-encoder and it is represented as follows: 

N 

s = ^ix«-z(*y, (5) 

i=l 

where N is the total number of training examples. The partial deriva¬ 
tives w.r.t weight wfj is represented as follows: 


4. TRAINING A DEEP DENOISING AUTO-ENCODER 
4.1. Greedy Layer-wise Pre-training 

Each layer of a deep auto-encoder can be pre-trained greedily to 
minimize the reconstruction loss L(x, z) of the data locally. Fig¬ 
ure shows a procedure of constructing a deep auto-encoder using 


^ = LL X Mi 

dtf ^ ^ 

(l) 

where ty is the fan-in input to neuron j in layer /, and —My — 

dw . . 

where is the output from neuron i at layer I — 1. 

































(a) Original 



Frame number 

(b) Masked 



Fig. 3. Reconstruction mean square errors for auto-encoders of dif¬ 
ferent architectures but same bottleneck dimension. 

Table 1. The table lists down the hyperparameters used for train¬ 
ing each model. Ir: learning rate, m: momentum, b: batch size, s: 
numpy random variable weight initialization seed 1^ . d: masking 
probability of each input dimension Ga. 


Fig. 2. These figures shows parts of original and masked spectra. In 
the right figure black points indicated masked regions. 

is the error transfer function which can be calculated recursively fol¬ 
lowing 


dE _ ^ dE dtf 

J ^ J - J ^ 


(7) 


dt\ ^ (1) do^ ’ 

where ^ = u; ■ j. For the output tanh layer we have = 

sech^(t-^^). Once we have the gradients of error function w.r.t to 
the weight parameters, we can fine-tune the network with error back- 
propagation. 


4.3. Corrupted data 

We used a masking technique reported in t25l to corrupt the training 
data for the denoising auto-encoder. This technique independently 
and randomly set the values of the training data in different dimen¬ 
sions to zero following a Bernoulli distribution. Figure shows an 
example of original and masked spectra. In this figure, black points 
indicate masked regions. 


5. EVALUATION 
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0.9 
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0.9 
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5.2. Configurations of the deep denoising auto-encoder 

Figure shows the reconstruction mean square errors of auto¬ 
encoders trained on raw frequency-warped spectrum with different 
number of hidden layers. It shows that the error decreases with 
more hidden layers, and that deep auto-encoder is better than shal¬ 
low auto-encoder with the same bottleneck dimension. For the 
results in the rest of paper, we use architecture of the auto-encoder 
as 2049-500-180-120 for producing the 120-dim acoustic features, 
tanh units for all the layers and the inputs are 2049-dim Bark-scale- 
based frequency-warped spectrum, which are preprocessed with 
global contrast normalization. The hyperparameters used for the 
layer-by-layer pre-training are searched randomly and the set of 
values that produce the best results are selected. Table shows the 
hyperparameters for the auto-encoders used in the experiments. 


This section shows experimental results. We have evaluated the pro¬ 
posed auto-encoder method in the context of analysis-by-synthesis 
condition and text-to-speech conditions. In the text-to-speech ex¬ 
periments, the synthetic voices using the proposed acoustic features 
were modeled using two state-of-the-art speech synthesis systems: 
HMM and DNN. 

5.1. Dataset 

The dataset we use consists of 4569 short audio waveforms uttered 
by a professional English female speaker and each waveform is 
around 5 seconds long. For each waveform, we first extract its fre¬ 
quency spectra using STRAIGHT vocoder with 2049 FFT points. 
We then extract the low dimensional feature from each 2049-dim 
STRAIGHT spectrum using autoencoder. All data was sampled 
at 48 kHz. For comparison of the proposed method, we extracted 
mel-cepstral coefficients that use the same dimensions as that of 
auto-encoder. All other acoustic features such as log FO and 25 
aperiodicity band energies are the same for all the systems. 


5.3. Analysis-by-synthesis experimental results 

First we report the analysis-by-synthesis experimental results. For 
this evaluation, we have divided the above database into three sub¬ 
sets, that is, training, validation and test. The training subset was 
used as training data for building the auto-encoder, the validation 
subset was used as a stopping criteria during training to prevent over¬ 
fitting, and the test subset was used for measuring log-spectral dis¬ 
tortion and listening test. 

Figurej^shows the original and reconstructed spectra using each 
technique (mel-cepstral analysis, deep auto-encoder, deep denois¬ 
ing auto-encoder). We can clearly see that the deep auto-encoders 
reconstruct high-frequency parts more precisely than mel-cepstral 
analysis. Figure shows log spectral distortion between the origi¬ 
nal spectra and reconstructed spectra, calculated on the test subset. 
We can observe that the deep auto-encoder has reduced the distor¬ 
tion significantly compared to the mel-cepstral analysis and denois¬ 
ing version further reduced the distortion. Figure shows subjec¬ 
tive preference scores of these methods. The number of listeners 
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Fig. 4. Original and reconstructed spectra using each technique. 
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Fig. 5. log spectral distortion between the original and reconstructed 
spectra. 
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Fig. 6. Results of preference tests using analysis-by-synthesis 
speech samples. In this figure, MCEP, DA and DDA refer to 
mel-cepstrum analysis, deep auto-encoder and deep denoising auto¬ 
encoder respectively. 

that performed this test were seven. They have participated in two 
preference tests. In the first preference test, they were asked to com¬ 
pare deep auto-encoder (DA) with mel-cepstral analysis (MCEP). In 
the second preference test, they were asked to compare deep auto¬ 
encoder with deep denoising autoencoder (DDA). Prom the figure, 
we can see that deep auto-encoder based speech samples sound more 
natural than mel-cesptral analysis based speech samples. Deep de¬ 
noising auto-encoder reduced the distortion, however, perceptual dif¬ 
ference between clean and denoising auto-encoder is not statistically 
significant. 


5.4. Text-to-speech experimental results 

Next we report the text-to-speech experimental results. Por the 
HMM-based speech synthesis, we have used a hidden semi-Markov 
model and the observation vectors for the spectral and excitation 
parameters contained static, delta and delta-delta values, with one 
stream for the spectrum, three streams for PO and one for the band- 
limited aperiodicity. The context-dependnet labels are built using the 
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Fig. 7. Results of preference tests using text-to-speech samples. In 
this figure, MCEP and DA refer to mel-cepstrum analysis and deep 
auto-encoder for the acoustic feature extraction, and HMM and DNN 
are the acoustic models. 



pronunciation lexicon Combilex (SO). Por the DNN-based speech 
synthesis, we have trained a five-hidden-layer DNN for mapping 
between linguistic contexts and auto-encoder-based or mel-cepstral 
acoustic features. The number of units in each of the hidden layers 
was set to 512. Random initialisation was used in a similar way 
to 111. Pigure 1^ shows subjective preference scores where we have 
compared the proposed auto-encoder feature with the conventional 
mel-cepstral feature in each of the HMM-based speech synthesis 
and the DNN-based speech synthesis systems. Listeners are the 
same as those for Pigure We can see that synthetic speech us¬ 
ing the proposed feature sound more natural than the conventional 
mel-cepstral features in both the synthesis methods. The proposed 
feature seems to suit the DNN-based speech synthesis better, but, 
this requires further investigation. 

6. CONCLUSIONS 

In this paper we have proposed the deep denoising auto-encoder 
technique to extract better acoustic features for speech synthesis. We 
have compared the new stochastic feature extractor with the conven¬ 
tional mel-cepstral analysis in the analysis-by-synthesis and text-to- 
speech experiments and have confirmed that the proposed method 
can increase the quality of synthetic speech in both the conditions. 

Our future work includes the improvement of the deep denois¬ 
ing auto-encoder. In this paper, we have used the simplest noise, 
i.e. masking and the improvement was observed only from objective 
evaluation. We shall use or design different types of noises to im¬ 
prove the deep denoising auto-encoder for speech synthesis further. 
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